# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='...', project_access_token='...')
Using the IBM Debater® Thematic Clustering of Sentences dataset, you will explore the dataset, then use it to create a model that dynamically groups sentences by their main topics and themes. This could be used in an application which collects customer feedback to help automatically organize the comments.
In this first notebook, you will load, explore, clean and visualize the data. You will then save the cleaned dataset to the Watson Studio project as a data asset to be loaded in Part 2 - Model Development to evaluate a K-Means clustering model.
The dataset contains 692 articles from Wikipedia, where the number of sections (clusters) in each article ranges from 5 to 12, and the number of sentences per article ranges from 17 to 1614.
Before you run this notebook complete the following steps:
When you import this project from the Watson Studio Gallery, a token should be automatically generated and inserted at the top of this notebook as a code cell such as the one below:
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='YOUR_PROJECT_ID', project_access_token='YOUR_PROJECT_TOKEN')
pc = project.project_context
If you do not see the cell above, follow these steps to enable the notebook to access the dataset from the project's resources:
More -> Insert project token
in the top-right menu sectionThis should insert a cell at the top of this notebook similar to the example given above.
If an error is displayed indicating that no project token is defined, follow these instructions.
Run the newly inserted cell before proceeding with the notebook execution below
import pandas as pd
from pandas import read_excel
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
This notebook uses one dataset from IBM Debater® Thematic Clustering of Sentences dataset named dataset.csv
. The method below sets the path for the data, loads and reads the dataset that is already imported into the Watson Studio Project as a data asset.
# Define get data file function
def get_file_handle(fname):
# Project data path for the raw data file
data_path = project.get_file(fname)
data_path.seek(0)
return data_path
This file contains 692 articles from Wikipedia, where the number of sections(clusters) in each article ranges from 5 to 12, and the number of sentences per article ranges from 17 to 1614.
Each row in the dataset is a Sentence
which is from a SectionTitle
and each SectionTitle
is from an Article Title
. The column Article Link
is the original source of the sentence.
# Define filename
DATA_PATH = 'dataset.csv'
# Use pandas to read the data
data_path = get_file_handle(DATA_PATH)
clustering_df = pd.read_csv(data_path)
clustering_df.head()
Article Title | Sentence | SectionTitle | Article Link | |
---|---|---|---|---|
0 | Moeller High School | Moeller's student-run newspaper, The Crusader,... | School publications | https://en.wikipedia.org/wiki/Moeller_High_School |
1 | Moeller High School | In 2008, The Crusader won First Place, the sec... | School publications | https://en.wikipedia.org/wiki/Moeller_High_School |
2 | Moeller High School | The Squire is a student literary journal that ... | School publications | https://en.wikipedia.org/wiki/Moeller_High_School |
3 | Moeller High School | Paul Keels - play-by-play announcer for Ohio S... | Notable alumni | https://en.wikipedia.org/wiki/Moeller_High_School |
4 | Moeller High School | Joe Uecker - Ohio State Senator (R-66) . | Notable alumni | https://en.wikipedia.org/wiki/Moeller_High_School |
In order for this data to be used to evaluate a clustering model, clusters need to be assigned. According to the readme file of the dataset (found in the original dataset zip here), each cluster is each SectionTitle
. That is, every sentence with the same section is in the same cluster. Thus, you can combine the Article Title
and SectionTitle
to get a group.
Two columns are added to the dataset to more easily show the clusters by giving each cluster a unique label:
label
is the unique string,label_id
is a unique numberclustering_df['label'] = clustering_df.apply(lambda row: row['Article Title'].strip().replace(" ", "_") + ":" + row['SectionTitle'].strip().replace(" ", "_"), axis=1)
clustering_df['label_id'] = clustering_df.label.astype('category').cat.codes
clustering_df.head()
Article Title | Sentence | SectionTitle | Article Link | label | label_id | |
---|---|---|---|---|---|---|
0 | Moeller High School | Moeller's student-run newspaper, The Crusader,... | School publications | https://en.wikipedia.org/wiki/Moeller_High_School | Moeller_High_School:School_publications | 3414 |
1 | Moeller High School | In 2008, The Crusader won First Place, the sec... | School publications | https://en.wikipedia.org/wiki/Moeller_High_School | Moeller_High_School:School_publications | 3414 |
2 | Moeller High School | The Squire is a student literary journal that ... | School publications | https://en.wikipedia.org/wiki/Moeller_High_School | Moeller_High_School:School_publications | 3414 |
3 | Moeller High School | Paul Keels - play-by-play announcer for Ohio S... | Notable alumni | https://en.wikipedia.org/wiki/Moeller_High_School | Moeller_High_School:Notable_alumni | 3413 |
4 | Moeller High School | Joe Uecker - Ohio State Senator (R-66) . | Notable alumni | https://en.wikipedia.org/wiki/Moeller_High_School | Moeller_High_School:Notable_alumni | 3413 |
Create a dictionary mapping the label ID to the label name.
id_to_category = dict( enumerate(clustering_df.label.astype('category').cat.categories) )
Looking at the number of sentences that correspond to each cluster (label), you can see that one cluster has a lot more sentences.
# One group has a lot more sentences.
clustering_df.label_id.value_counts()
32 1308 27 164 30 126 4240 94 3013 91 ... 2695 3 5105 3 979 3 3026 3 1365 3 Name: label_id, Length: 5555, dtype: int64
id_to_category[32]
'1980_Birthday_Honours:United_Kingdom_and_Colonies'
Remove this cluster from the dataset to keep the groups together to test the model using the second notebook. Having this one very large group may not be an accurate representation of the real data.
# Remove rows in that top category
top_id = clustering_df.label_id.value_counts().index[0]
df = clustering_df.loc[(clustering_df.label != id_to_category[top_id])]
df.head()
Article Title | Sentence | SectionTitle | Article Link | label | label_id | |
---|---|---|---|---|---|---|
0 | Moeller High School | Moeller's student-run newspaper, The Crusader,... | School publications | https://en.wikipedia.org/wiki/Moeller_High_School | Moeller_High_School:School_publications | 3414 |
1 | Moeller High School | In 2008, The Crusader won First Place, the sec... | School publications | https://en.wikipedia.org/wiki/Moeller_High_School | Moeller_High_School:School_publications | 3414 |
2 | Moeller High School | The Squire is a student literary journal that ... | School publications | https://en.wikipedia.org/wiki/Moeller_High_School | Moeller_High_School:School_publications | 3414 |
3 | Moeller High School | Paul Keels - play-by-play announcer for Ohio S... | Notable alumni | https://en.wikipedia.org/wiki/Moeller_High_School | Moeller_High_School:Notable_alumni | 3413 |
4 | Moeller High School | Joe Uecker - Ohio State Senator (R-66) . | Notable alumni | https://en.wikipedia.org/wiki/Moeller_High_School | Moeller_High_School:Notable_alumni | 3413 |
Next, set the features to be Sentence
, which is all the text data we are interested in. You will be predicting the label_id
with the model in the second notebook. Below you see that there are 5554 (1 removed) clusters, and on average 8 sentences are in each cluster.
X = df.Sentence
y = df.label_id
print('Total data rows: ', len(X))
print('Unique groups: ', len(y.unique()))
print('Avgerage number of rows per group: ', clustering_df.label_id.value_counts().mean())
Total data rows: 44809 Unique groups: 5554 Avgerage number of rows per group: 8.301890189018902
To test a model, break this dataset into smaller datasets because in the real world, you likely would not want to have 5000 unique clusters. So let's split the data so that each set has about 5 clusters. To do this, randomly take 5000 of the 5554 clusters, then split this into 1000 sets. Now there are 1000 sets to test on (list_of_groups
).
np.random.seed(42) # get reproducible results
number_of_groups = 1000
sampled_categories = np.random.choice(y.unique(), size=5000)
list_of_groups = np.split(sampled_categories, number_of_groups) # 5 categories in each group
# Convert list_of_groups to ad DataFrame to save to the project
groups_of_themes = pd.DataFrame(pd.Series(np.array(list_of_groups).tolist()), columns=['group'])
groups_of_themes.head()
group | |
---|---|
0 | [2822, 1492, 2014, 4508, 4393] |
1 | [535, 2896, 3550, 1670, 2837] |
2 | [739, 659, 1015, 1362, 3938] |
3 | [4167, 4753, 1516, 1386, 1705] |
4 | [3029, 3826, 3057, 3969, 5299] |
Each theme (cluster) has about 8 sentences on average.
df.label_id.value_counts().describe()
count 5554.000000 mean 8.067879 std 8.506450 min 3.000000 25% 4.000000 50% 5.000000 75% 9.000000 max 164.000000 Name: label_id, dtype: float64
You want a relatively uniform distribution of themes since you want roughly equal sized cluster data to set. In the histogram shown below, it appears that there is a uniform distribution.
ax = df.label_id.hist()
ax.set_xlabel('Theme ID')
ax.set_ylabel('Count')
ax.set_title('Distribution of Themes')
Text(0.5, 1.0, 'Distribution of Themes')
Next, look at how many words are included in each theme (label
). On average, the themes are 4 words long and the longest theme is 22 words. Fifty percent of themes are four or less words.
df['label'].str.split('_').apply(len).describe()
count 44809.000000 mean 4.396148 std 2.577010 min 1.000000 25% 3.000000 50% 4.000000 75% 6.000000 max 20.000000 Name: label, dtype: float64
ax = df['label'].str.split('_').apply(len).hist()
ax.set_xlabel('Number of Words in Theme Label')
ax.set_ylabel('Count');
On average, the sentences used have about 21 words. To test the model in the second notebook, you would not want all really short or really long sentences since that is probably not likely to be seen in real comments. About 21 words is aligned with a normal average in a sentence. The histogram below also shows that the number of words is skewed to the right (more sentences are shorter rather than longer).
df['Sentence'].str.split().apply(len).describe()
count 44809.000000 mean 21.846995 std 9.967591 min 5.000000 25% 14.000000 50% 20.000000 75% 28.000000 max 50.000000 Name: Sentence, dtype: float64
ax = df['Sentence'].str.split().apply(len).hist()
ax.set_xlabel('Number of Words in Each Sentence')
ax.set_ylabel('Count');
Finally, save the cleaned dataset as a Project asset for later re-use. You should see an output like the one below if successful:
{'file_name': 'themes.csv',
'message': 'File saved to project storage.',
'bucket_name': 'ibmdebaterthematicclusteringofsen...',
'asset_id': '...'}
and
{'file_name': 'groups_of_themes.csv',
'message': 'File saved to project storage.',
'bucket_name': 'ibmdebaterthematicclusteringofsen...',
'asset_id': '...'}
Note: In order for this step to work, your project token (see the first cell of this notebook) must have Editor
role. By default this will overwrite any existing file.
project.save_data("themes.csv", df.to_csv(index=False, float_format='%g'), overwrite=True)
project.save_data("groups_of_themes.csv", groups_of_themes.to_csv(index=False), overwrite=True)
Part 2 - Model Development
notebook to explore the cleaned dataset.This notebook was created by the Center for Open-Source Data & AI Technologies.